[spark] support distributed execution of vector search on spark by Stefanietry · Pull Request #8108 · apache/paimon

Stefanietry · 2026-06-03T13:15:08Z

Purpose
Purpose: Currently, vector search operation is executed on a single node within the driver, which may lead to performance bottlenecks when dealing with large amounts of data. This issue aims to implement a distributed execution capability.
Linked issue: #8107

Tests
Add distributed vector search test via the parameter vector-search.distribute.enabled on org.apache.paimon.spark.SparkMultimodalITCase#testVector

JingsongLi · 2026-06-03T13:31:50Z

+        Broadcast<RoaringNavigableMap64> preFilterBroadcast =
+                preFilter == null ? null : engineContext.broadcast(preFilter);
+
+        SerializableFunction<List<byte[]>, Optional<byte[]>> task =


This distributed path returns java.util.Optional<byte[]> from the Spark task and then collects it back to the driver. java.util.Optional is not Serializable in Java 8, so Spark will fail serializing the task result with NotSerializableException once this branch actually runs. Could we return a serializable value instead, for example byte[] with null meaning empty, or a small serializable wrapper?

thanks for pointing it , only considered kryo before, and has fixed as suggested.

JingsongLi · 2026-06-03T13:32:16Z

        assertThat(df.columns()).hasSize(4);
        rows = df.collectAsList();
        assertThat(rows).hasSize(5);
+        spark.sql("set spark.paimon.vector-search.distribute.enabled = true;");


This assertion does not seem to exercise the new Spark-distributed path: the table only has a small number of vector splits, while SparkVectorReadImpl falls back to super.read unless splits.size() >= global-index.thread-num * 2 (default 64 splits). Because of that, the serialization/distributed execution code can be broken and this test would still pass. Could we force the distributed branch in this test, for example by setting spark.paimon.global-index.thread-num=1 or by creating enough index shards/splits?

thanks for pointing it , has fixed in latest version

JingsongLi · 2026-06-03T13:33:34Z

+        return dataOutputSerializer.getCopyOfBuffer();
+    }
+
+    public ScoredGlobalIndexResult deserialize(byte[] data) throws IOException {


This helper cannot round-trip an empty ScoredGlobalIndexResult. serialize() writes only scoreSize=0 for scored results whose bitmap is empty, and the existing deserializer interprets scoreSize == 0 as a plain GlobalIndexResult; this deserialize(byte[]) method then fails the instanceof ScoredGlobalIndexResult check. In the distributed reader, a split group can legitimately produce an empty scored result when the scalar pre-filter excludes all rows in that group, so this can make filtered distributed searches fail even though the local path handles empty optionals. We probably need an explicit scored/non-scored marker in the serialization format, or avoid serializing empty scored results as successful task results.

This method only considers the scenario where ScoredGlobalIndexResult is not null. For the null scenario, it directly returns null (in the previous version, it returned Optional.empty) and avoids serialization. Please refer to org.apache.paimon.spark.read.SparkVectorReadImpl#read for detailed information.

If you want to distinguish between GlobalIndexResult and ScoredGlobalIndexResult in org.apache.paimon.globalindex.GlobalIndexResultSerializer#deserialize(org.apache.paimon.io.DataInputView),
I'd be happy to create a separate issue to support this later; please see if that's feasible？

JingsongLi

LGTM if tests passed.

JingsongLi · 2026-06-05T03:52:07Z


-    abstract boolean preserveOnDelete();
+/** SPI for engine specific {@link VectorSearchBuilder} creation. */
+public interface VectorSearchBuilderProvider {


Can you avoid introducing this interface? You can just modify PaimonBaseScan to custom vector search logical.

like this ? val vectorSearchBuilder = if (CoreOptions.fromMap(table.options).vectorSearchDistributeEnabled()) { new SparkVectorSearchBuilderImpl(table) } else { table.newVectorSearchBuilder(); }

The previous implementation was planned to integrate the construction of VectorSearchBuilder into org.apache.paimon.table.InnerTable#newVectorSearchBuilder; If the aforementioned custom method in PaimonBaseScan is feasible, I have completed the modification latest version.

JingsongLi · 2026-06-05T03:52:28Z

+import java.util.List;
+
+/** Factory for {@link VectorSearchBuilder}. */
+public class VectorSearchBuilderFactory {


Can you avoid introducing this interface? You can just modify PaimonBaseScan to custom vector search logical.

As mentioned above, it has been removed.

JingsongLi reviewed Jun 3, 2026

View reviewed changes

Stefanietry force-pushed the opt_vector_search_on_spark branch from 93353be to e8d7694 Compare June 4, 2026 03:37

JingsongLi reviewed Jun 4, 2026

View reviewed changes

Stefanietry force-pushed the opt_vector_search_on_spark branch 3 times, most recently from ef31bad to dbc62f3 Compare June 4, 2026 07:09

Stefanietry closed this Jun 4, 2026

Stefanietry reopened this Jun 4, 2026

Stefanietry force-pushed the opt_vector_search_on_spark branch 3 times, most recently from 8de0c54 to 1d7b368 Compare June 4, 2026 11:32

JingsongLi reviewed Jun 5, 2026

View reviewed changes

[spark] support distributed execution of vector search on spark

7994022

Stefanietry force-pushed the opt_vector_search_on_spark branch from 1d7b368 to 7994022 Compare June 5, 2026 04:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[spark] support distributed execution of vector search on spark#8108

[spark] support distributed execution of vector search on spark#8108
Stefanietry wants to merge 1 commit into
apache:masterfrom
Stefanietry:opt_vector_search_on_spark

Stefanietry commented Jun 3, 2026

Uh oh!

JingsongLi Jun 3, 2026

Uh oh!

Stefanietry Jun 4, 2026

Uh oh!

JingsongLi Jun 3, 2026

Uh oh!

Stefanietry Jun 4, 2026

Uh oh!

JingsongLi Jun 3, 2026

Uh oh!

Stefanietry Jun 4, 2026

Uh oh!

JingsongLi left a comment •

edited

Loading

Uh oh!

JingsongLi Jun 5, 2026

Uh oh!

Stefanietry Jun 5, 2026

Uh oh!

JingsongLi Jun 5, 2026

Uh oh!

Stefanietry Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Stefanietry commented Jun 3, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JingsongLi left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JingsongLi left a comment •

edited

Loading